I grew up playing baseball as well as other sports ever since I was 5 years old and played all the way through high school. Once my playing career started to come to a close, I realized I truly did not love playing baseball as well as sports in general, I was just always interested in the competition and comradery that comes with being on a team. When I came into college I started out not knowing what I wanted to do, I always wanted to do something with science and technology but could not pinpoint exactly what I wanted to do for the rest of my life. After some trial and error exploring different fields I eventually discovered the Data Science career, immediately I was hooked. The idea of using data that comes from our world to gain insights, as well as hearing about machine learning which at my first glance seemed like magic, using previous data to make accurate predictions about the future, why is nobody talking about this? This was back in 2020, so data science was just becoming a big thing and no way near as big as it is now. After 3 years I could confidently say that yup people indeed start hearing and talking about this. Once I went farther into my college career, I gained an understanding of this "magic" that I was witnessing as a young little freshman.
Over the summer, I saw an email being sent out about an open position working as a research analyst for the UCSB Baseball team. This seemed like a perfect internship for myself, as I still have my passion for baseball, as well as professional sports in general. My dream job after all is to work for a professional sports team, preferably for baseball or football, as those are my two favorite sports. A couple of weeks after submitting my application, I received an email saying I was selected for an interview. I was excited about the news and prepared my responses for the generic questions such as tell me about yourself, or what is a project you have worked on. I left the interview feeling confident about my answers, and a week later I got an email saying I was picked for the job. The first meeting was a general introduction to the team and talking about what we will be working on. The team leader informed us that the plan is to do an independent research project that was recommended by the coaching staff of the baseball team.
The project I was selected for involved looking at advanced statistics in used in professional baseball and trying to find a way to come up with helpful predictors to decide whether a player is labeled as a good or bad performer. The task seemed very ambitious, so I started to do some research of my own to see how other baseball statisticians / data scientist have worked on a similar project. The first thing I started to see was the importance of WAR, no we are not talking about people fighting each other in battle but the statistics Wins After Replacement.
WAR was invented in 2009, but was not heavily used until 2012 in the MLB (Major League Baseball). A little information about WAR.
WAR, which stands for Wins Above Replacement, is a single statistic used in baseball to measure a player's overall contribution to their team's success. It quantifies how much better a player is compared to an average or replacement-level player.
In simpler terms, WAR tells you how many additional wins a player brings to their team compared to a hypothetical player you could easily replace from the minor leagues or waivers. A higher WAR value means a player is more valuable to their team's success. It's a way to gauge a player's all-around performance, considering hitting, fielding, pitching (for pitchers), and base-running, in a single number.these two articles talk more about what the replacement word means in WAR.
These two articles go deeper into what exactly the "replacment player" is when looking at WAR.
https://library.fangraphs.com/misc/war/replacement-level/
https://blogs.fangraphs.com/unifying-replacement-level/
To explain it simply the replacement represents replacing a replacement level player, a replacement level player is defined as a player whose WAR calculation comes out to 0 or also defined as, the level of production you could get from a player that would cost you nothing but the league minimum salary to acquire.
To start out the calculation for WAR is pretty simple
Batting Runs (BR): Quantifies the number of runs a player contributes through battins such as RBIs, Hits, and Home Runs
Base Running Runs (BRR): It quantifies the number of runs a player contributes through baserunning, which includes things like stealing bases, taking extra bases, and avoiding double plays.
Fielding Runs (FR): This measures a player's defensive performance, including their ability to field ground balls, make accurate throws, and cover their position effectively.
Positional Adjustment (PA): This factor reflects the relative value of a player's position. For example, a shortstop is considered more valuable than a first baseman because of the defensive demands of the position.
League Adjustment (LA): This accounts for the overall level of competition in the league. A player in a highly competitive league may receive a higher adjustment than a player in a less competitive league.
Replacement Level Runs (RLR): This represents the value a team would lose if the player were replaced by a freely available minor league or waiver wire player.
Runs Per Win: This represents the number of runs a player needs to contribute above a replacement-level player in order to be credited with one additional win for their team.
The problem is, all these statistics are complicated to calculate themselves. To star out the way I did these calculations involved me using a website called Trumedia. Trumedia is what our baseball team uses to store and anaylze statistics for any college baseball player / team in the country. Through Trumedia they have a way to make new statistics based on other ones, so I used this feature to create all the following statistics.
We already have the wOBA for players as well as PA on trumedia. To calculate the lgwOBA I went on trumedua and filtered the data to be for the Big West conference, obtaining a value of .347. To find the wOBA scale I found an article going through how to calculate this (https://rfrey22.medium.com/collegiate-linear-weights-f0237cf40451).
To put wOBA on a scale to look like OBP, we need to treat an out as zero, so we add .33 runs to each event so that we have run values relative to an out, which gives us
Walk: 0.64
Hit by pitch: 0.66
Single: .79
Double: 1.12
Triple: 1.40
Home run: 1.74
We multiply and add each value above with the number of events that occurred during the season, then divide by plate appearances (removing IBBs) to get the denominator. The league wOBA is really scaled to OBP, again, minus the intentional walks. The Division 1 wOBA in 2019 was .363 and the denominator was approximately .304. Divide the wOBA by the denominator and we get 1.194, after rounding. This is the wOBA Scale, which is used for two things, calculating weighted Runs Above Average (wRAA) and used to calculate the final linear weights.
When plugging in our own data we get a wOBA Scale of 1.127, getting our wRAA (Batting runs) table as the following.
We already see some players who are considered star players leading the team in wRAA such as Jared Sundstrom and Aaron Parker.
Looking at the equation for Baserunning Runs (BSR) it might look pretty simple just one thing plus another, the problem is these things are quite tricky to figure out. We first need to look at the lgGDP which is the amount of double plays the league has grounded into divided by lgGDPo which is the total amount of double play opportunities the whole league has had. A double play opportunity is defined as when a runner is on 1st base with < 2 outs. GDPo and GDP are the same but with no lg in front, indicating this is for the player individually not the league as a whole. lgRPO is the league runs per out, meaning how many outs a normal team in the league gets before getting a run. After calculating these values we get
lgGDPo = 19,015
lgGDP = 325
lgRPO = .228
Looking at Part 2 we now are looking at calculating wSB which we are now looking at some unknowns which are runSB, runCS, and lgwSB. runSB is a constant which is 0.2, runCS is calculated using lgRPO which we found before, and lgwSB is the league-wide weighted stolen bases. Having weighted stolen bases allows for measuring the number of runs added above/below average through stolen bases, based on successful steals and times caught stealing. This means we are easily able to see if a player is better or worse than the average player at stealing bases. Once we have lgwSB we can plug it into our original equation for wSB. After plugging in all of our values, we get the following for our team.
Now that we have wGDP and wSB we just do some simple addition to get BSR
Now we look at an important part of the WAR calculation, Positional Adjustment Runs (Rpos). Rpos is important because as we know some positions are harder to play in baseball than others, for example looking at 1B compared to SS we see that a vast majority of players in the MLB could probably play a game or two at first base, and you would maybe not even notice, whereas if a left fielder came and played short stop it would be very appeasement that player is not comfortable at that position.
Using this formula allows us to put weights onto what position the player is playing, which is important for our calculation.
We see not just our best hitters being on the top of the list, but players who play difficult positions such as catcher and short stop.
Replacement level runs (RLR) is looking at how many more/less runs a player contributes than a replacement player. The calculation for this is not very complicated, as we are just looking at how many runs a player will contribute over all of his plate appearances compared to the total amount of plate appearances in the league.
Finally, we got all we need to look at the all encompassing stat WAR. When looking at WAR, the formula is very simply just adding all these stats that took hours and hours of calculating and doing some basic addition and division.
But if we look at all the things needed to calculate these 5 singular statistics, our equation looks like.
(((([wOBA] - .347) / (1.12689705)) [PA]) + (((((325 / 19015) ([AB] =)) - [GDP]) .228) + (([SB] .2) + ([CS] (-.381)) - (0.003 ([1B] + [BB] + [HBP] - [IBB])))) + (((([GS_C] 27 9) + ([GS_1B] 27 -9.5) + ([GS_2B] 27 3) + ([GS_3B] 27 2) +([GS_SS] 27 7) + ([GS_LF] 27 -7) + ([GS_CF] 27 2.5) + ([GS_RF] 27 -7) +([G_DH] 4 -15)) / (1350))) + ((((0.235 15404) (((9 ([RS] / ([G] 27)) 1.5) + 3)) [PA]) / 617113))) / (((9 ([RS] / ([G] 27)) * 1.5) + 3))
Now we can look at who on our team or as a matter of fact any team in The Big West to see how many wins a player adds over a replacement level player.
Going back to what WAR actually means, if we look at the player with the highest WAR on our team last year, we see Aaron Parker with 7.098. This basically means if UCSB replaced Aaron parker with a different catcher that has a WAR of 0 basically meaning he would be a very average college baseball catcher we would lose 7 more games on average. So it is safe to say that Aaron Parker helps the Gauchos win games.
We can look at all college baseball players, not just ones on our team, to see who were some of the best players in the country according to WAR.
If we look at who was drafted in the first round of the MLB draft we see some of the same names that led the country in WAR, meaning that WAR is definitely a good advanced metric to look at when valuing how good a player, specifically how many wins could you expect to get by adding that player to your team.
Another important thing to look at is a team cumulative, where we can get a good picture into seeing how many wins does a team add over a team completely filled with replacement level players. I created a chart with data wrapper that illustrates the cumulative WAR for position players only. Position players are just players who are not Pitchers so pretty this calculation does not take into account how good pitchers are on the team which is a whole other thing that Trumedia has some limitations on, so I was not able to calculate WAR for pitchers
So it is safe to say the Gauchos are one of the best teams in The Big West, as we hope to take home the title in the 2023-24 season. Now that we have WAR calculated as well as other advanced statistics we can import the new dataset with our new metrics to do some data analysis as well as looking to build a model to predict these advanced metrics.
# Packages
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import export_graphviz, plot_tree
from tabulate import tabulate
# Ignore warnings message
import warnings
warnings.filterwarnings('ignore')
# Data
df_2023_player = pd.read_csv('UCSB_2023Season_Data.csv')
df_2023_season = pd.read_csv('UCSB_2023Season_Team_Data.csv')
df_d1 = pd.read_csv('D1_Baseball.csv')
df_BigWest = pd.read_csv('BigWest.csv')
pitch_data = pd.read_csv('Pitch_Data.csv')
UCSB_Stats = pd.read_csv('UCSB_stats.csv')
The 2 main data sets we will be looking at are df_d1 which is hitters statistics for all of Division 1 baseball and `pitch_data
As stated before, all of this data comes from Trumedia which is what is used at our baseball program as well as many across the country to analyze statistics
pitch_data.head()
| Count | Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Pitcher | Batter | Type | PitchResult | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1-2 | 1 | 0.8 | -20.2 | -41.8 | 7.6 | T. Weston (CP) | J. Sundstrom (UCSB) | Changeup | Single on a Line Drive | 77.9 | 1752.0 | 87.3 | 19.3 | 14.50% |
| 1 | 0-2 | 2 | 0.95 | -15 | -39.9 | 9.7 | T. Weston (CP) | A. Parker (UCSB) | Changeup | Double on a Line Drive | 76.5 | 1732.2 | 98.7 | 14.8 | 25.00% |
| 2 | 1-2 | 2 | 0.205 | -14.3 | -24.6 | 14.9 | T. Weston (CP) | Z. Darby (UCSB) | Sinker | Ground Out | 85.5 | 2327.3 | 100.2 | -15.5 | 96.90% |
| 3 | 2-2 | 0 | 0.562 | -16.6 | -27 | 13.8 | T. Weston (CP) | C. Kirtley (UCSB) | Changeup | Fly Out | 84.3 | 2284.2 | 101.7 | 32.7 | 25.80% |
| 4 | 0-0 | 0 | 0.089 | -8.8 | -39.4 | 6.9 | T. Weston (CP) | N. Oakley (UCSB) | Changeup | Single on a Ground Ball | 78.3 | 2383.5 | 86.2 | -28.3 | 97.40% |
Count: The current count of the at bat when the pitch took place, a count of 1-2 means 1 ball 2 strikes.
Outs: Amount of outs the defense has currently
xAVG: The xAVG that Trumedia predicts based on the pitch
HorzBrk: Horizontal Break, The distance, measured in inches, between where the pitch actually crosses the front of home plate side-wise, and where it would have crossed home plate side-wise if had it traveled in a perfectly straight line from release. Positive = toward 3rd base. Negative = toward 1st base
VertBrk: Vertical Break, The distance, measured in inches, between where the pitch actually crosses the front of home plate height-wise, and where it would have crossed home plate height-wise if had it traveled in a perfectly straight line from release, completely unaffected by gravity
IndVertBrk: Induced Vertical Break, The distance, measured in inches, between where the pitch actually crosses the front of home plate height-wise, and where it would have crossed home plate height-wise if had it traveled in a perfectly straight line from release, but affected by gravity
Pitcher: The pitcher pitching currently
Batter: The player up to bat
Type: The type of pitch thrown
PitchResult: Detailed description of what happened on the pitch
Vel: The velocity of the pitch thrown
Spin: The amount of times the ball spun from time of leaving the pitcher's hand and crossing home plate
ExitVel: The exit velocity of the ball coming off the bat
LaunchAng: The angle the ball comes off the bat
pCallStrike%: The probability that the ball would have been called a strike by the umpire if the player did not swing at it
df_d1_2 = df_d1.drop(columns = ['Rank','newestTeamLevel', 'playerId', 'abbrevName', 'player', 'newestTeamAbbrevName', 'newestTeamId', 'newestTeamLocation'])
df_d1_2.head()
| playerFullName | pos | newestTeamName | batsHand | throwsHand | fWAR | xWOBADf | wOBA | xWOBA | xAVG | xOBP | OBP | AVG | AB | RPW | RLR | rPOS | BSR | wRAA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dylan Crews | CF | Louisiana State University | R | R | 17.620 | .110 | .530 | .421 | .286 | .451 | .558 | .425 | 252 | 3.70 | 7.15 | 3.45 | 0.826 | 53.689 |
| 1 | Brock Wilken | 3B | Wake Forest University | R | R | 16.693 | .140 | .534 | .394 | .234 | .421 | .502 | .342 | 234 | 3.67 | 6.75 | 2.64 | 0.104 | 51.844 |
| 2 | Nolan Schanuel | 1B | Florida Atlantic University | L | R | 15.557 | .173 | .594 | .421 | .285 | .487 | .603 | .449 | 196 | 3.58 | 5.82 | -10.96 | 0.137 | 60.767 |
| 3 | Charlie Pagliarini | 3B | Fairfield University | L | R | 15.555 | .175 | .560 | .385 | .263 | .425 | .530 | .399 | 193 | 3.59 | 5.28 | 0.47 | 2.630 | 47.408 |
| 4 | Kyle Teel | C | University of Virginia | L | R | 15.164 | .131 | .481 | .351 | .277 | .362 | .476 | .409 | 257 | 3.51 | 6.09 | 11.70 | 0.175 | 35.227 |
We already know what these stats are from the section on WAR
One good thing about our data is for all the datasets we are looking at, there is NO missing data.
print(df_d1_2.isnull().sum())
print(pitch_data.isnull().sum())
playerFullName 0 pos 0 newestTeamName 0 batsHand 0 throwsHand 0 fWAR 0 xWOBADf 0 wOBA 0 xWOBA 0 xAVG 0 xOBP 0 OBP 0 AVG 0 AB 0 RPW 0 RLR 0 rPOS 0 BSR 0 wRAA 0 dtype: int64 Count 0 Outs 0 xAVG 0 HorzBrk 0 VertBrk 0 IndVertBrk 0 Pitcher 0 Batter 0 Type 0 PitchResult 0 Vel 0 Spin 0 ExitVel 0 LaunchAng 0 pCallStrk% 0 dtype: int64
One thing I initially noticed when looking at players statistics was that their x stats such as xwOBA, xAVG, and xOBP were much lower than what their wOBA, AVG, and OBP was. Some background info for these x stats the x stands for expected. The MLB website defines xAVG (also known as xBA) as
"Expected Batting Average (xBA) is a Statcast metric that measures the likelihood that a batted ball will become a hit.
Each batted ball is assigned an xBA based on how often comparable balls -- in terms of exit velocity, launch angle and, on certain types of batted balls, Sprint Speed -- have become hits since Statcast was implemented Major League wide in 2015. (As of January 2019, xBA now factors in a batter's seasonal Sprint Speed on "topped" or weakly hit" balls).
For example, a line drive to the outfield with an xBA of .700 is given that figure because balls with a similar exit velocity and launch angle have become hits seven out of 10 times."
If you did not know much about baseball you might think expected stats are nonsensical, because after all, why does it matter if a player was expected to hit .300 in the season if he only actually hit .260 . The answer to this is hitting a baseball is VERY hard and if you go up to bat and hit a ball that goes 105 mph at a 17 degree launch angle it doesn't matter if the ball ends up going right at a player for an out, the fact you hit the ball 105mph at a 17-degree angle means you did VERY good that at bat. As the example says you might hit a ball very good that gets labeled as an xBA of .700 meaning if you hit a ball at that same exact velocity and angle you would get a hit 70% of the time which for professional baseball is AMAZING
I then went ahead and visualized just how many players had a positive difference in their x stats and their real statistics
def custom_converter(value):
try:
return float(value)
except ValueError:
if value == '-':
return float('-0')
else:
return None
df_d1_2['xAVG'] = df_d1_2['xAVG'].apply(custom_converter)
df_d1_2['AVG'] = df_d1_2['AVG'].apply(custom_converter)
#df_d1_2['xAVG_diff'] = df_d1_2['xAVG_diff'].apply(custom_converter)
df_d1_2['xAVG_diff'] = df_d1_2['AVG'] - df_d1_2['xAVG']
# Select only columns we want
plot_data = df_d1_2[['xAVG_diff']]
# Count positive and negative difference
def count_pos_neg(column):
positive_count = (column > 0).sum()
negative_count = (column < 0).sum()
return pd.Series({'Positive': positive_count, 'Negative': negative_count})
# Apply the function to each column
plot_data2= plot_data.apply(count_pos_neg)
# Plot the bar chart
ax = plot_data2.plot(kind='bar', rot=0, color= 'lightblue', width=.8)
ax.set_ylabel('Count')
ax.set_title('Count of Positive and Negative Difference For xAVG In D1 Baseball')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
# Display the plot
plt.show()
The statistic I looked at was AVG. A positive difference means that the player's actual AVG was greater than what their expected AVG was. In real life terms, this means the xAVG underpredicted $\frac{2622}{2850}$ = 92% of player's real AVG. This is when I knew there had to be something interested going on, since it does not seem accurate that 93% of players are doing better than what their expected stats were.
I then applied a filter to where the difference had to be at least 0.1 since it does make sense to count a player doing 0.3 better than their xAVG to be the same as someone only doing 0.001 better than their xAVG
# Filter to be >= 0.1
plot_data = plot_data[plot_data['xAVG_diff'] >= .1]
# Apply the function to each column
plot_data2= plot_data.apply(count_pos_neg)
# Plot the bar chart
ax = plot_data2.plot(kind='bar', rot=0, color= 'lightblue', width=.8)
ax.set_ylabel('Count')
ax.set_title('Count of Positive and Negative Difference Of At Least 0.1 For xAVG In D1 Baseball')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
# Display the plot
plt.show()
Now we see that there are 0 players in D1 baseball that did at least 0.1 worse than their xAVG. D1 baseball is a lot of teams, so then I went ahead and only looked at The Big West conference
df_BigWest['xAVG'] = df_BigWest['xAVG'].apply(custom_converter)
df_BigWest['AVG'] = df_BigWest['AVG'].apply(custom_converter)
df_BigWest['xAVG_diff'] = df_BigWest['AVG'] - df_BigWest['xAVG']
# Select only columns we want
plot_data = df_BigWest[['xAVG_diff']]
# Apply the function to each column
plot_data2= plot_data.apply(count_pos_neg)
# Plot the bar chart
ax = plot_data2.plot(kind='bar', rot=0, color= 'lightblue', width=.8)
ax.set_ylabel('Count')
ax.set_title('Count of Positive and Negative Difference For xAVG In Big West')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
# Display the plot
plt.show()
Applying the same 0.1 filter
plot_data = plot_data[plot_data['xAVG_diff'] >= .1]
# Apply the function to each column
plot_data2= plot_data.apply(count_pos_neg)
# Plot the bar chart
ax = plot_data2.plot(kind='bar', rot=0, color= 'lightblue', width=.8)
ax.set_ylabel('Count')
ax.set_title('Count of Positive and Negative Difference Of At Least 0.1 For xAVG In Big West')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
# Display the plot
plt.show()
We see the same thing as before, where 0 players are underperforming their xAVG by at least 0.1
We see from the chart that the expected AVG stat has vastly more positive values than negative, this is interesting because we get an idea that expected stats are not a good statistics to look at. For all of D1 baseball $\frac{2622}{2850}$ = 92% and for The Big West $\frac{88}{99}$ = 89.8% of the time players exceed their expected wOBA statistic. One key thing I would like to look at is trying to make a model which can calculate expected stats better than the ones already setup. As we can see 92% of players preformed better than what their expected stats were predicted to be, meaning that whatever method they are using to calculate the expected stats is not too good.
Trumedia has data on pitch by pitch stats for every player at bat. We can filter the data for being balls that the batter made contact on to do some data analysis on seeing what type of pitches become hits.
pitch_data.head()
| Count | Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Pitcher | Batter | Type | PitchResult | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1-2 | 1 | 0.8 | -20.2 | -41.8 | 7.6 | T. Weston (CP) | J. Sundstrom (UCSB) | Changeup | Single on a Line Drive | 77.9 | 1752.0 | 87.3 | 19.3 | 14.50% |
| 1 | 0-2 | 2 | 0.95 | -15 | -39.9 | 9.7 | T. Weston (CP) | A. Parker (UCSB) | Changeup | Double on a Line Drive | 76.5 | 1732.2 | 98.7 | 14.8 | 25.00% |
| 2 | 1-2 | 2 | 0.205 | -14.3 | -24.6 | 14.9 | T. Weston (CP) | Z. Darby (UCSB) | Sinker | Ground Out | 85.5 | 2327.3 | 100.2 | -15.5 | 96.90% |
| 3 | 2-2 | 0 | 0.562 | -16.6 | -27 | 13.8 | T. Weston (CP) | C. Kirtley (UCSB) | Changeup | Fly Out | 84.3 | 2284.2 | 101.7 | 32.7 | 25.80% |
| 4 | 0-0 | 0 | 0.089 | -8.8 | -39.4 | 6.9 | T. Weston (CP) | N. Oakley (UCSB) | Changeup | Single on a Ground Ball | 78.3 | 2383.5 | 86.2 | -28.3 | 97.40% |
It is well known in the baseball world that higher exit velocity tends to more hits, this is simply because no matter what a hard hit ball is always harder to catch than a slow hit ball. Even if you hit the ball directly to a player, if you hit the ball 110 mph it is a lot harder to catch than a ball hit 70 mph. One thing we can see at first is for the PitchResult column gives a detailed description of what happened on the pitch. We need to simplify this to one column that is easy to read values indicating if a pitch was a hit or an out
def categorize_pitch_result(row):
keywords = ['Out', 'Error', "Fielder's Choice"]
for keyword in keywords:
if keyword in row['PitchResult']:
return "Out"
return 'Hit'
pitch_data['Hit_Out'] = pitch_data.apply(lambda row: categorize_pitch_result(row), axis=1)
pitch_data.groupby('Hit_Out').count()['Count']
Hit_Out Hit 402 Out 574 Name: Count, dtype: int64
In baseball, a Fielder's Choice occurs when a player hits a ball, typically a ground ball, presenting a scenario where the defense could have potentially gotten the batter out. However, they opt to make a strategic play to get another player on base out instead. For instance, imagine a runner on first base. The batter hits a ground ball toward the shortstop. Instead of attempting to directly retire the batter, the shortstop chooses to throw the ball to the second baseman covering second base, effectively forcing out the runner coming from first. Consequently, the batter-runner is allowed to reach first base safely.
On the other hand, an error refers to a defensive play where the team "should have" successfully gotten the player out, but due to a misplay, the batter reaches base safely. For example, if a batter hits a soft ground ball to the shortstop, and the shortstop mishandles the ball, the batter attains first base safely. In essence, an error is characterized by a lapse in defensive execution that grants the offensive player an opportunity to advance on the bas
def convert_percentage_to_integer(percentage_string):
if '-' in percentage_string:
return 0
# Remove the percentage sign and convert to integer
return float(percentage_string.strip('%'))
pitch_data['pCallStrk%'] = pitch_data['pCallStrk%'].apply(convert_percentage_to_integer)
Lets look at a basic histogram of exit velocity
# Make plot
fig = px.histogram(pitch_data, x='ExitVel', nbins=12, title='Exit Velocity Histogram',
labels={'ExitVel': 'Exit Velocity (MPH)'},
opacity=0.7, color_discrete_sequence=['lightblue'])
# Add black outline to distinguish each bar easily
fig.update_traces(marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
paper_bgcolor='rgb(17,17,17)',
plot_bgcolor='rgb(17,17,17)',
font=dict(color='white'),
title=dict(font=dict(color='white')),
xaxis=dict(tickfont=dict(color='white')),
yaxis=dict(tickfont=dict(color='white')),
)
# Show the plot
fig.show()
We are interested in seeing if a ball is a hard hit ball which the MLB dictates as a ball with an exit velocity of at least 95mph tends to lead to more hits than balls that are not hard hit balls.
# Filter data to see only hard hit balls ( >= 95mph)
plot_data2 = pitch_data[pitch_data['ExitVel'] >= 95]
fig2 = px.pie(plot_data2, names='Hit_Out', title='Hard Hit Balls Outcome',
labels={'Hit_Out': 'Outcome'}, color='Hit_Out')
# Customize the layout to enhance the appearance
fig2.update_layout(
paper_bgcolor='rgb(17,17,17)',
font=dict(color='white'),
title=dict(font=dict(color='white')),
)
# Show the plot
fig2.show()
# Filter data to see only hard hit balls ( >= 95mph)
plot_data3 = pitch_data[pitch_data['ExitVel'] < 95]
fig3 = px.pie(plot_data3, names='Hit_Out', title='Non Hard Hit Balls Percentage by Hit/Out',
labels={'Hit_Out': 'Outcome'}, color='Hit_Out')
# Customize the layout to enhance the appearance
fig3.update_layout(
paper_bgcolor='rgb(17,17,17)',
font=dict(color='white'),
title=dict(font=dict(color='white')),
)
# Show the plot
fig3.show()
We can see visually that when balls are not hard hit balls, they tend to be outs vastly more than if they were hard hit. Now we want to look at correlations between our statistics.
# Select columns we want
correlation_df = pitch_data.drop(columns = ['Count', 'Type', 'PitchResult'])
# Fix columns to be integers
correlation_df['VertBrk'] = correlation_df['VertBrk'].apply(custom_converter)
correlation_df['IndVertBrk'] = correlation_df['IndVertBrk'].apply(custom_converter)
correlation_df['HorzBrk'] = correlation_df['HorzBrk'].apply(custom_converter)
correlation_df['xAVG'] = correlation_df['xAVG'].apply(custom_converter)
# Dummy code Hit_Out
category_mapping = {'Hit': 1, 'Out': 0}
correlation_df['Hit_Out'] = correlation_df['Hit_Out'].map(category_mapping)
# Make correlation plot
correlation_matrix = correlation_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Plot')
plt.show()
We start to see some pattern that if you were an experienced baseball fan/player would seem not out of the ordinary. For example, any baseball player knows that more velocity on a pitch means it will break more. This is not new knowledge, and it would be quite redundant to build a model to try and predict if a pitch is a hit or not based on exit velocity and launch angle because it is very obvious to a baseball fan that this is the case. To help develop some new predictors for our model, we will need to try and create some new stats based on our data.
We all know that hitting a baseball is a very hard thing to do even if the pitcher does not throw a good pitch, so it is just that much harder if it is a good pitch. This is why I want to develop a formula to decide whether a pitch is labeled as a good pitch or not because it should not be weighted the same if a player hits a slow 85 mph fast ball right down the middle for a hit whereas someone else might hit a 102 mph fastball on the corner of a strike zone for a hit.
pitch_data.head()
| Count | Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Pitcher | Batter | Type | PitchResult | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | Hit_Out | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1-2 | 1 | 0.8 | -20.2 | -41.8 | 7.6 | T. Weston (CP) | J. Sundstrom (UCSB) | Changeup | Single on a Line Drive | 77.9 | 1752.0 | 87.3 | 19.3 | 14.5 | Hit |
| 1 | 0-2 | 2 | 0.95 | -15 | -39.9 | 9.7 | T. Weston (CP) | A. Parker (UCSB) | Changeup | Double on a Line Drive | 76.5 | 1732.2 | 98.7 | 14.8 | 25.0 | Hit |
| 2 | 1-2 | 2 | 0.205 | -14.3 | -24.6 | 14.9 | T. Weston (CP) | Z. Darby (UCSB) | Sinker | Ground Out | 85.5 | 2327.3 | 100.2 | -15.5 | 96.9 | Out |
| 3 | 2-2 | 0 | 0.562 | -16.6 | -27 | 13.8 | T. Weston (CP) | C. Kirtley (UCSB) | Changeup | Fly Out | 84.3 | 2284.2 | 101.7 | 32.7 | 25.8 | Out |
| 4 | 0-0 | 0 | 0.089 | -8.8 | -39.4 | 6.9 | T. Weston (CP) | N. Oakley (UCSB) | Changeup | Single on a Ground Ball | 78.3 | 2383.5 | 86.2 | -28.3 | 97.4 | Hit |
One thing to note beforehand is that we are only looking at decisive pitches only, which means it was a pitch where the player put the ball in play. There is no data that comes from pitches that did not result in a hit or out from the pitch.
# Format our data like we did earlier
pitch_data['VertBrk'] = pitch_data['VertBrk'].apply(custom_converter)
pitch_data['IndVertBrk'] = pitch_data['IndVertBrk'].apply(custom_converter)
pitch_data['HorzBrk'] = pitch_data['HorzBrk'].apply(custom_converter)
pitch_data['xAVG'] = pitch_data['xAVG'].apply(custom_converter)
pitch_data.groupby('Hit_Out').mean()
| Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | |
|---|---|---|---|---|---|---|---|---|---|---|
| Hit_Out | ||||||||||
| Hit | 0.937811 | 0.516475 | 3.147761 | -27.709453 | 10.707463 | 85.504229 | 2163.394527 | 92.099005 | 10.904478 | 78.348010 |
| Out | 0.961672 | 0.150399 | 2.821951 | -28.233624 | 10.901568 | 85.036760 | 2168.865854 | 86.552787 | 17.086934 | 77.857491 |
We see that for outs the launch angle is usually way higher, indicating that if you hit a fly ball no matter what it is always more likely going to be an out. All the other statistics do not seem to have any effect on whether it is a hit or not. We can now look at the types of pitches and see what results we see
pitch_data.groupby(['Type','Hit_Out']).mean()
| Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | Hit_Out | ||||||||||
| Changeup | Hit | 0.954545 | 0.553500 | 2.389394 | -31.078788 | 10.237879 | 81.692424 | 1857.763636 | 94.774242 | 13.734848 | 73.498485 |
| Out | 0.970000 | 0.164490 | 6.559000 | -31.131000 | 11.391000 | 81.698000 | 1872.532000 | 86.908000 | 18.157000 | 77.275000 | |
| Curveball | Hit | 0.653061 | 0.495000 | -5.118367 | -53.422449 | -5.830612 | 77.724490 | 2313.042857 | 88.408163 | 9.751020 | 82.369388 |
| Out | 0.906667 | 0.183200 | -7.317333 | -55.109333 | -6.176000 | 76.846667 | 2314.352000 | 89.740000 | 13.440000 | 80.293333 | |
| Cutter | Hit | 1.500000 | 0.443000 | -1.800000 | -27.800000 | 9.550000 | 86.400000 | 2411.400000 | 90.750000 | 19.100000 | 97.400000 |
| Out | 0.833333 | 0.027667 | 0.066667 | -22.066667 | 14.766667 | 86.733333 | 2399.566667 | 89.133333 | 32.933333 | 93.783333 | |
| Fastball | Hit | 0.965278 | 0.495750 | 5.407639 | -16.073611 | 18.152778 | 89.289583 | 2199.782639 | 93.691667 | 11.802778 | 77.384722 |
| Out | 0.946078 | 0.141652 | 4.601961 | -15.997549 | 18.892157 | 89.285784 | 2195.294608 | 88.282843 | 19.660294 | 78.716667 | |
| Sinker | Hit | 0.952381 | 0.490238 | 10.057143 | -22.053571 | 13.391667 | 89.269048 | 2133.464286 | 90.847619 | 9.092857 | 82.572619 |
| Out | 0.957895 | 0.176874 | 10.030526 | -21.314737 | 13.871579 | 89.615789 | 2146.930526 | 83.982105 | 9.522105 | 75.298947 | |
| Slider | Hit | 1.052632 | 0.585667 | -4.585965 | -39.431579 | 2.743860 | 81.463158 | 2332.115789 | 90.042105 | 8.731579 | 76.045614 |
| Out | 1.032967 | 0.111901 | -4.257143 | -37.789011 | 3.196703 | 81.183516 | 2350.084615 | 82.438462 | 18.795604 | 77.029670 | |
| Splitter | Out | 1.333333 | 0.030333 | 2.666667 | -33.400000 | 10.100000 | 80.633333 | 1348.600000 | 78.433333 | 53.633333 | 52.233333 |
What about for just the types of pitches
pitch_data.groupby(['Type','Hit_Out']).mean().groupby('Type').mean()
| Outs | xAVG | HorzBrk | VertBrk | IndVertBrk | Vel | Spin | ExitVel | LaunchAng | pCallStrk% | |
|---|---|---|---|---|---|---|---|---|---|---|
| Type | ||||||||||
| Changeup | 0.962273 | 0.358995 | 4.474197 | -31.104894 | 10.814439 | 81.695212 | 1865.147818 | 90.841121 | 15.945924 | 75.386742 |
| Curveball | 0.779864 | 0.339100 | -6.217850 | -54.265891 | -6.003306 | 77.285578 | 2313.697429 | 89.074082 | 11.595510 | 81.331361 |
| Cutter | 1.166667 | 0.235333 | -0.866667 | -24.933333 | 12.158333 | 86.566667 | 2405.483333 | 89.941667 | 26.016667 | 95.591667 |
| Fastball | 0.955678 | 0.318701 | 5.004800 | -16.035580 | 18.522467 | 89.287684 | 2197.538623 | 90.987255 | 15.731536 | 78.050694 |
| Sinker | 0.955138 | 0.333556 | 10.043835 | -21.684154 | 13.631623 | 89.442419 | 2140.197406 | 87.414862 | 9.307481 | 78.935783 |
| Slider | 1.042799 | 0.348784 | -4.421554 | -38.610295 | 2.970281 | 81.323337 | 2341.100202 | 86.240283 | 13.763592 | 76.537642 |
| Splitter | 1.333333 | 0.030333 | 2.666667 | -33.400000 | 10.100000 | 80.633333 | 1348.600000 | 78.433333 | 53.633333 | 52.233333 |
My approach involves establishing a baseline for what constitutes a commendable pitch within various pitch statistics, such as Vertical Break, Velocity, and more. By computing the average metrics for each type of pitch stat, I aim to define a standard for what can be considered a good pitch. Subsequently, I will implement a function that evaluates individual pitches. If at least three out of the five pertinent statistics surpass their respective means, the pitch will be classified as good. Moreover, if all five statistics exceed their averages, the pitch will receive the exceptional designation of 'great.' Conversely, pitches failing to meet these criteria will be categorized as bad.
Changeup_VertBrk = -31.804266
Changeup_IndVertBrk = 10.862185
Changeup_HorzBrk = 1.684830
Changeup_Vel = 81.417713
Changeup_Spin = 1805.965918
Curveball_VertBrk = -57.236189
Curveball_IndVertBrk = -7.993903
Curveball_HorzBrk = -7.770217
Curveball_Vel = 76.785978
Curveball_Spin = 2260.106003
Cutter_VertBrk = -24.933333
Cutter_IndVertBrk = 12.158333
Cutter_HorzBrk = -0.866667
Cutter_Vel = 86.566667
Cutter_Spin = 2405.483333
Fastball_VertBrk = -15.699860
Fastball_IndVertBrk = 18.425569
Fastball_HorzBrk = 5.792103
Fastball_Vel = 90.245774
Fastball_Spin = 2178.273488
Sinker_VertBrk = -21.825332
Sinker_IndVertBrk = 13.608768
Sinker_HorzBrk = 7.423325
Sinker_Vel = 89.511871
Sinker_Spin = 2170.012930
Slider_VertBrk = -33.400811
Slider_IndVertBrk = 2.064417
Slider_HorzBrk = -1.555598
Slider_Vel = 81.011028
Slider_Spin = 2392.011269
Splitter_VertBrk = -33.400000
Splitter_IndVertBrk = 10.100000
Splitter_HorzBrk = 0.030333
Splitter_Vel = 80.633333
Splitter_Spin = 1348.600000
Changeup
pitch_data_changeup = pitch_data[pitch_data['Type'] == 'Changeup']
pitch_data_changeup['Good_Pitch'] = pitch_data_changeup.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Changeup_VertBrk,
row['IndVertBrk'] >= Changeup_IndVertBrk,
row['HorzBrk'] >= Changeup_HorzBrk,
row['Vel'] >= Changeup_Vel,
row['Spin'] >= Changeup_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Changeup_VertBrk,
row['IndVertBrk'] >= Changeup_IndVertBrk,
row['HorzBrk'] >= Changeup_HorzBrk,
row['Vel'] >= Changeup_Vel,
row['Spin'] >= Changeup_Spin
]) >= 3 else 'Bad'), axis=1)
Curveball
pitch_data_curveball = pitch_data[pitch_data['Type'] == 'Curveball']
pitch_data_curveball['Good_Pitch'] = pitch_data_curveball.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Curveball_VertBrk,
row['IndVertBrk'] >= Curveball_IndVertBrk,
row['HorzBrk'] >= Curveball_HorzBrk,
row['Vel'] >= Curveball_Vel,
row['Spin'] >= Curveball_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Curveball_VertBrk,
row['IndVertBrk'] >= Curveball_IndVertBrk,
row['HorzBrk'] >= Curveball_HorzBrk,
row['Vel'] >= Curveball_Vel,
row['Spin'] >= Curveball_Spin
]) >= 3 else 'Bad'), axis=1)
Cutter
pitch_data_cutter = pitch_data[pitch_data['Type'] == 'Cutter']
pitch_data_cutter['Good_Pitch'] = pitch_data_cutter.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Cutter_VertBrk,
row['IndVertBrk'] >= Cutter_IndVertBrk,
row['HorzBrk'] >= Cutter_HorzBrk,
row['Vel'] >= Cutter_Vel,
row['Spin'] >= Cutter_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Cutter_VertBrk,
row['IndVertBrk'] >= Cutter_IndVertBrk,
row['HorzBrk'] >= Cutter_HorzBrk,
row['Vel'] >= Cutter_Vel,
row['Spin'] >= Cutter_Spin
]) >= 3 else 'Bad'), axis=1)
Fastball
pitch_data_fastball = pitch_data[pitch_data['Type'] == 'Fastball']
pitch_data_fastball['Good_Pitch'] = pitch_data_fastball.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Fastball_VertBrk,
row['IndVertBrk'] >= Fastball_IndVertBrk,
row['HorzBrk'] >= Fastball_HorzBrk,
row['Vel'] >= Fastball_Vel,
row['Spin'] >= Fastball_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Fastball_VertBrk,
row['IndVertBrk'] >= Fastball_IndVertBrk,
row['HorzBrk'] >= Fastball_HorzBrk,
row['Vel'] >= Fastball_Vel,
row['Spin'] >= Fastball_Spin
]) >= 3 else 'Bad'), axis=1)
Sinker
pitch_data_sinker = pitch_data[pitch_data['Type'] == 'Sinker']
pitch_data_sinker['Good_Pitch'] = pitch_data_sinker.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Sinker_VertBrk,
row['IndVertBrk'] >= Sinker_IndVertBrk,
row['HorzBrk'] >= Sinker_HorzBrk,
row['Vel'] >= Sinker_Vel,
row['Spin'] >= Sinker_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Sinker_VertBrk,
row['IndVertBrk'] >= Sinker_IndVertBrk,
row['HorzBrk'] >= Sinker_HorzBrk,
row['Vel'] >= Sinker_Vel,
row['Spin'] >= Sinker_Spin
]) >= 3 else 'Bad'), axis=1)
Slider
pitch_data_slider = pitch_data[pitch_data['Type'] == 'Slider']
pitch_data_slider['Good_Pitch'] = pitch_data_slider.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Slider_VertBrk,
row['IndVertBrk'] >= Slider_IndVertBrk,
row['HorzBrk'] >= Slider_HorzBrk,
row['Vel'] >= Slider_Vel,
row['Spin'] >= Slider_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Slider_VertBrk,
row['IndVertBrk'] >= Slider_IndVertBrk,
row['HorzBrk'] >= Slider_HorzBrk,
row['Vel'] >= Slider_Vel,
row['Spin'] >= Slider_Spin
]) >= 3 else 'Bad'), axis=1)
Splitter
pitch_data_splitter = pitch_data[pitch_data['Type'] == 'Splitter']
pitch_data_splitter['Good_Pitch'] = pitch_data_splitter.apply(lambda row: 'Great' if all([
row['VertBrk'] >= Splitter_VertBrk,
row['IndVertBrk'] >= Splitter_IndVertBrk,
row['HorzBrk'] >= Splitter_HorzBrk,
row['Vel'] >= Splitter_Vel,
row['Spin'] >= Splitter_Spin
]) else ('Good' if sum([
row['VertBrk'] >= Splitter_VertBrk,
row['IndVertBrk'] >= Splitter_IndVertBrk,
row['HorzBrk'] >= Splitter_HorzBrk,
row['Vel'] >= Splitter_Vel,
row['Spin'] >= Splitter_Spin
]) >= 3 else 'Bad'), axis=1)
frames = [pitch_data_slider, pitch_data_sinker, pitch_data_fastball, pitch_data_curveball, pitch_data_cutter, pitch_data_changeup, pitch_data_splitter]
new_pitch_data = pd.concat(frames)
pd.DataFrame(new_pitch_data.groupby(['Good_Pitch', 'Type']).count()['Count'])
| Count | ||
|---|---|---|
| Good_Pitch | Type | |
| Bad | Changeup | 76 |
| Curveball | 35 | |
| Cutter | 5 | |
| Fastball | 175 | |
| Sinker | 77 | |
| Slider | 101 | |
| Splitter | 2 | |
| Good | Changeup | 62 |
| Curveball | 80 | |
| Cutter | 3 | |
| Fastball | 147 | |
| Sinker | 89 | |
| Slider | 45 | |
| Splitter | 1 | |
| Great | Changeup | 28 |
| Curveball | 9 | |
| Fastball | 26 | |
| Sinker | 13 | |
| Slider | 2 |
Lets plot the data to visualize it easily
# Make new df for plotting
plot_data_pitchtype = new_pitch_data.groupby(['Good_Pitch', 'Type']).count()['Count']
plot_data_pitchtype = plot_data_pitchtype.reset_index()
plot_data_pitchtype = plot_data_pitchtype.pivot(index='Type', columns='Good_Pitch', values='Count')
# Make plot
plot_data_pitchtype.plot(kind='bar', stacked=True, colormap='gist_rainbow', figsize=(10, 6))
plt.title('Good Pitch Histogram by Type of Pitch')
plt.xlabel('', size = 15)
plt.ylabel('Count', size=15)
plt.legend(title='Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=0, fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
The distribution of Pitch Type looks good for the conditions we set up. We see a good amount of good pitches which is expected because we play in a good conference which has a lot of good pitchers so we should not be seeing too many bad pitches. As for great pitches, we only see a small amount because it is pretty rare to see a pitcher throw a great pitch, they might only throw one maybe once every inning.
The models we are going to be looking at will be ones that are good for classification problems, since our outcome is predicting whether a pitch will result in a hit or not. There are a lot of models that preform well for classification that I will implement. The method I will be using will not just use classification to predict a hit or out, but more rather to predict the probability of every outcome being a hit. This is because the way xAVG is calculated is getting the likelyhood of a getting a hit based on that batted ball.
Here, the coefficients (β₀ to β₅) represent the impact each variable has on the log-odds of a hit. A positive coefficient indicates the variable increases the likelihood of a hit, while a negative coefficient indicates the opposite. The larger the coefficient's magnitude, the stronger the relationship between the variable and the probability of a hit.
This model allows us to understand the factors influencing the batter's success. For instance, a positive coefficient for LaunchAng suggests that higher launch angles lead to a greater chance of a hit. Similarly, a negative coefficient for pCallStrk% might indicate that batters who see pitches that are less likely to be called a strike are less likely to hit the ball.
First we will need to dummy code our Hit_Out column so it is represented as a 1 when it is a hit and 0 for an out. If our model predictes log-odds that are > 0.5 (50%) then the prediction will be 1, aka a hit
new_pitch_data['Good_Pitch'] = new_pitch_data['Good_Pitch'].map({'Bad': 0, 'Good': 3, 'Great': 5})
new_pitch_data['New_xAVG'] = new_pitch_data['Hit_Out'].map({'Hit': 1, "Out": 0})
Create our train test split. We will be using strattiifed sampling to get even proportions of hit and outs for our train and testing data.
# Define our predictors and response variable
X = new_pitch_data[['ExitVel', 'LaunchAng', 'pCallStrk%', 'Good_Pitch']]
y = new_pitch_data['New_xAVG']
# Split the data into training and testing sets using Stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)
We can use cross validation to see how our logistic regression model works on our data
# Define the hyperparameters we want to tune
param_grid = {
'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100]
}
# Create a logistic regression model
logreg_model = LogisticRegression(random_state=42)
# Create a StratifiedKFold cross-validation object
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Create GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=logreg_model, param_grid=param_grid,
scoring='accuracy', cv=stratified_cv, verbose=1, n_jobs=-1)
# Standardize your features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.fit_transform(X)
# Fit the GridSearchCV to your data
grid_search.fit(X_train_scaled, y_train)
# Get the best estimator (model with the best hyperparameters)
best_logreg_model = grid_search.best_estimator_
accuracy_scores = cross_val_score(logreg_model, X_scaled, y, cv=stratified_cv, scoring='accuracy')
# Print accuracy scores for each fold
for fold, accuracy in enumerate(accuracy_scores, start=1):
print(f'Fold {fold} Accuracy: {accuracy:.4f}')
# Calculate and print the mean and standard deviation of accuracy scores
cv_accuracy_lr = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
print(f'Mean Accuracy: {cv_accuracy_lr:.4f}')
print(f'Standard Deviation: {std_accuracy:.4f}')
# Print the best hyperparameters found by GridSearchCV
print('Best Hyperparameters:', grid_search.best_params_)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fold 1 Accuracy: 0.6378
Fold 2 Accuracy: 0.6769
Fold 3 Accuracy: 0.6051
Fold 4 Accuracy: 0.5949
Fold 5 Accuracy: 0.6103
Mean Accuracy: 0.6250
Standard Deviation: 0.0296
Best Hyperparameters: {'C': 0.1, 'penalty': 'l2'}
While logistic regression provides a simplistic probabilistic approach to predicting hit outcomes, another powerful tool in our arsenal is the decision tree. This method offers a distinct perspective for analyzing the factors influencing batting success.
Unlike logistic regression, which models the probability of a hit directly, decision trees employ a hierarchical structure to classify pitches as hits or outs based on a series of logical rules. These rules are created by splitting the data based on the independent variables (ExitVel, LaunchAng, etc.) that best differentiate between hits and outs. From here we can then look at the probabiloty that the pitch will be predicted as a hit based on the leaf we reach in our hierarchical structure
# Vary the max_depth and calculate cross-validated accuracy
max_depth_values = range(1,50)
cv_accuracies_dt = []
for max_depth in max_depth_values:
dt_classifier = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
cv_scores_dt = cross_val_score(dt_classifier, X_train, y_train, cv=5, scoring='accuracy')
cv_accuracies_dt.append(np.mean(cv_scores_dt))
# Find the best max_depth value
best_max_depth = max_depth_values[np.argmax(cv_accuracies_dt)]
# Evaluate the DecisionTreeClassifier model
print(f"Best max_depth for DecisionTreeClassifier: {best_max_depth}")
print("Cross Validation Average Accuracy Score:", np.mean(cv_accuracies_dt))
Best max_depth for DecisionTreeClassifier: 5 Cross Validation Average Accuracy Score: 0.6823652537938253
While both logistic regression and decision trees offer valuable perspectives on predicting hit probability, the random forest model presents an even more powerful approach. This ensemble learning technique combines the strengths of multiple decision trees to achieve improved accuracy and robustness.
Instead of constructing a single decision tree, the random forest algorithm builds a collection of individual trees, each trained on a different subset of the data and using a random subset of features at each split. This diversity helps to avoid overfitting and ensures the final predictions are not overly reliant on any single variable. For each pitch, the random forest model gathers the predictions from all individual trees and combines them using a voting scheme (majority vote for classification, average for regression). This collective prediction represents the combined wisdom of the entire forest, minimizing the impact of individual errors and biases. The randomness injected during training helps to prevent the model from overfitting the training data, leading to better generalization performance on unseen pitches.
We can use cross validation to see how well our model preforms using different values of our number of estimators
# Vary the number of estimators and calculate cross-validated accuracy
estimators_range = range(10,200,10)
cv_accuracies_rf = []
for n_estimators in estimators_range:
rf_classifier = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
cv_scores = cross_val_score(rf_classifier, X_train, y_train, cv=5, scoring='accuracy')
cv_accuracies_rf.append(np.mean(cv_scores))
# Find the optimal number of estimators
optimal_estimators = estimators_range[np.argmax(cv_accuracies_rf)]
# Display results
print(f"Optimal Number of Estimators: {optimal_estimators}")
print(f"Cross-validated Average Accuracy Scores: {np.mean(cv_accuracies_rf)}")
Optimal Number of Estimators: 160 Cross-validated Average Accuracy Scores: 0.7284075573549258
Gradient Boosted Models (GBMs), known for their powerful predictive capabilities. GBMs work by sequentially building an ensemble of decision trees, where each tree focuses on correcting the weaknesses of the previous one. This iterative process leads to a more robust and accurate model compared to individual trees.
The process starts with a simple model, like a shallow decision tree, that makes predictions for the hit probability. The model's errors are analyzed, and gradients are calculated for each data point, indicating how much the model's prediction should be adjusted to improve overall accuracy. A new decision tree is constructed, focusing on learning from the gradients and correcting the predictions for data points with the highest errors. This process of calculating gradients, building new trees, and adjusting predictions continues iteratively until a stopping criterion is met, such as reaching a desired level of accuracy or exceeding a certain number of trees. Similar to random forests, the final prediction for each pitch is obtained by averaging the outputs from all the individual trees in the ensemble.
GBMs often achieve higher accuracy than other models due to their ability to learn complex relationships between features and the target variable. GBMs can handle a wide range of data types and can be easily adapted to incorporate new features or modify the loss function for specific needs. hile not as straightforward as individual decision trees, GBMs can offer insights into the features' importance through techniques like SHAP analysis, helping to understand the factors influencing hit probability. The number of trees in the ensemble can be tuned to control the model's complexity, balancing accuracy with interpretability and computational cost.
To optimize model performance, we'll utilize cross-validationand fine-tune the number of trees for our gradient boosted model
# Create a Gradient Boosted Model
gb_classifier = GradientBoostingClassifier(n_estimators=30, learning_rate=0.1, random_state=42)
# Create a StratifiedKFold cross-validation object with 5 folds
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Setup our range of number of estimators
param_grid = {'n_estimators': range(10,100,)}
# Perform cross-validation and get accuracy scores for each fold
accuracy_scores = cross_val_score(gb_classifier, X_scaled, y, cv=stratified_cv, scoring='accuracy')
# Print accuracy scores for each fold
for fold, accuracy in enumerate(accuracy_scores, start=1):
print(f'Fold {fold} Accuracy: {accuracy:.4f}')
# Create GridSearchCV to find the best n_estimators value
grid_search = GridSearchCV(estimator=gb_classifier, param_grid=param_grid,
scoring='accuracy', cv=stratified_cv, verbose=1, n_jobs=-1)
# Fit the GridSearchCV to your data
grid_search.fit(X_scaled, y)
# Get the best estimator (model with the best n_estimators)
best_gb_model = grid_search.best_estimator_
# Print the best n_estimators value and accuracy score
best_n_estimators = grid_search.best_params_['n_estimators']
best_accuracy = grid_search.best_score_
print(f'Best n_estimators: {best_n_estimators}')
print(f'Best Accuracy: {best_accuracy:.4f}')
# Calculate and print the mean and standard deviation of accuracy scores
mean_accuracy_gb = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
print(f'Mean Accuracy: {mean_accuracy_gb:.4f}')
print(f'Standard Deviation: {std_accuracy:.4f}')
Fold 1 Accuracy: 0.7908 Fold 2 Accuracy: 0.7282 Fold 3 Accuracy: 0.7487 Fold 4 Accuracy: 0.7231 Fold 5 Accuracy: 0.7590 Fitting 5 folds for each of 90 candidates, totalling 450 fits Best n_estimators: 32 Best Accuracy: 0.7530 Mean Accuracy: 0.7500 Standard Deviation: 0.0243
mean_accuracy_lr = cv_accuracy_lr
mean_accuracy_dt = np.mean(cv_accuracies_dt)
mean_accuracy_rf = np.mean(cv_accuracies_rf)
mean_accuracy_gb
Model = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosted']
accuracy = [mean_accuracy_lr, mean_accuracy_dt, mean_accuracy_rf, mean_accuracy_gb]
data = {'Model': Model, 'Accuracy': accuracy}
model_accuracy_df = pd.DataFrame(data)
# Create a new column with values formatted as percentages
model_accuracy_df['Accuracy'] = model_accuracy_df['Accuracy'].astype(float)
model_accuracy_df['Accuracy'] *= 100
model_accuracy_df['Accuracy'] = model_accuracy_df['Accuracy'].round(2)
# Make plot
fig = px.bar(model_accuracy_df, x='Model', y='Accuracy', text=model_accuracy_df['Accuracy'].map('{:.2f}%'.format), template='plotly_white',
color_discrete_sequence=['lightblue'],
labels={'Accuracy': 'Accuracy (%)'},
title="Model's Mean Accuracy Comparison")
fig.update_layout(
template='plotly_dark',
xaxis=dict(
title=None,
tickfont=dict(size=16),
),
yaxis_title=None,
showlegend=False,
font=dict(color='white'),
plot_bgcolor='rgba(0,0,0,0)',
title_font=dict(size=26, family='Arial', color='white'),
title_x=0.5,
)
fig.show()
# Train the model with the optimal number of estimators
rf_classifier_optimal = RandomForestClassifier(n_estimators=optimal_estimators, random_state=42)
rf_classifier_optimal.fit(X_train, y_train)
# Make predictions and get accuracy score on test set
y_pred_rf = rf_classifier_optimal.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy_rf:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
# Append predicted probabilities to dataset
new_pitch_data['New_xAVG_rf'] = rf_classifier_optimal.predict_proba(X)[:, 1]
# Create dataset to compare AVG, xAVG and our new stat New_xAVG
compare_df = pd.DataFrame(new_pitch_data.groupby('Batter').mean()[['xAVG', 'New_xAVG_rf']]).reset_index()
player_avg_df = pd.DataFrame(new_pitch_data.groupby(['Batter', 'Hit_Out']).count()['Count']).reset_index()
proportions = player_avg_df.pivot(index='Batter', columns='Hit_Out', values='Count').fillna(0)
# Calculate the normal AVG per batter
proportions['AVG'] = proportions['Hit'] / (proportions['Hit'] + proportions['Out'])
proportions = proportions.reset_index()
compare_df = pd.merge(compare_df, proportions, on='Batter', how='inner')
compare_df = compare_df.drop(columns = ['Hit', 'Out'])
# Create final dataset
ab_df = pd.DataFrame(new_pitch_data.groupby('Batter').count()['Count']).rename(columns={'Count': 'AB'})
compare_df = pd.merge(compare_df, ab_df, on = 'Batter', how='inner')
compare_df['Batter'] = compare_df['Batter'].apply(lambda x: ' '.join(x.split(' ', 2)[:2]))
compare_df
Accuracy: 0.72
Classification Report:
precision recall f1-score support
0 0.75 0.79 0.77 115
1 0.68 0.62 0.65 81
accuracy 0.72 196
macro avg 0.71 0.70 0.71 196
weighted avg 0.72 0.72 0.72 196
Confusion Matrix:
[[91 24]
[31 50]]
| Batter | xAVG | New_xAVG_rf | AVG | AB | |
|---|---|---|---|---|---|
| 0 | A. Parker | 0.285472 | 0.428935 | 0.416667 | 108 |
| 1 | B. Mortensen | 0.286111 | 0.426736 | 0.430556 | 72 |
| 2 | C. Kirtley | 0.276450 | 0.347813 | 0.360000 | 100 |
| 3 | C. Nunez | 0.245736 | 0.380307 | 0.377358 | 106 |
| 4 | I. Brethowr | 0.387163 | 0.476962 | 0.511628 | 86 |
| 5 | J. Brown | 0.013000 | 0.000000 | 0.000000 | 1 |
| 6 | J. Newman | 0.331318 | 0.401705 | 0.386364 | 44 |
| 7 | J. Sebring | 0.356365 | 0.455889 | 0.442308 | 52 |
| 8 | J. Sundstrom | 0.370966 | 0.456463 | 0.454545 | 88 |
| 9 | J. Trimble | 0.484083 | 0.488021 | 0.416667 | 12 |
| 10 | J. Williams | 0.249800 | 0.317500 | 0.300000 | 10 |
| 11 | L. Mccollum | 0.314000 | 0.464987 | 0.473118 | 93 |
| 12 | L. Mosby | 0.198778 | 0.315278 | 0.333333 | 9 |
| 13 | N. Oakley | 0.256420 | 0.310236 | 0.289855 | 69 |
| 14 | Z. Darby | 0.260111 | 0.392163 | 0.404762 | 126 |
We can create a chart to see how well our New xAVG compares to the Trumedia xAVG and a players actualt AVG they had in the dataset
# Plotting
fig = px.bar(
compare_df,
x='Batter',
y=['New_xAVG_rf', 'AVG', 'xAVG'],
labels={'value': 'Values', 'variable': 'Statistics'},
width=1000,
height=600,
color_discrete_map={
'New_xAVG_rf': '#3498db',
'AVG': '#F5E834',
'xAVG': 'lightgrey'
},
)
fig.update_layout(
barmode='group',
xaxis=dict(
tickangle=55,
tickmode='array',
tickvals=list(range(len(compare_df['Batter']))),
ticktext=compare_df['Batter'],
tickfont=dict(size=15, color='white', family='Arial')
),
paper_bgcolor='rgba(17,17,17,17)',
plot_bgcolor='rgba(17,17,17,17)',
title=dict(text='<b>Comparison between New xAVG, AVG, and xAVG for Random Forest Model</b>', font=dict(color='white', size=18),x=0.45,y=0.93, xanchor='center'),
legend=dict(title=dict(font=dict(color='white'))),
yaxis=dict(tickfont=dict(color='white')),
)
fig.update_traces(marker_line_color='black', marker_line_width=2)
fig.show()
# Fit our model and make predictions on the test set
best_gb_model.fit(X_train, y_train)
y_pred = best_gb_model.predict(X_test)
# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy_gb:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Append predicted probabilities to dataset
new_pitch_data['New_xAVG_gb'] = best_gb_model.predict_proba(X)[:, 1]
# Create dataset to compare AVG, xAVG and our new stat New_xAVG
compare_df2 = pd.DataFrame(new_pitch_data.groupby('Batter').mean()[['xAVG', 'New_xAVG_gb']]).reset_index()
player_avg_df2 = pd.DataFrame(new_pitch_data.groupby(['Batter', 'Hit_Out']).count()['Count']).reset_index()
proportions2 = player_avg_df2.pivot(index='Batter', columns='Hit_Out', values='Count').fillna(0)
# Calculate the normal AVG per batter
proportions2['AVG'] = proportions2['Hit'] / (proportions2['Hit'] + proportions2['Out'])
proportions2 = proportions2.reset_index()
compare_df2 = pd.merge(compare_df2, proportions2, on='Batter', how='inner')
compare_df2 = compare_df2.drop(columns = ['Hit', 'Out'])
# Create final dataset
ab_df2 = pd.DataFrame(new_pitch_data.groupby('Batter').count()['Count']).rename(columns={'Count': 'AB'})
compare_df2 = pd.merge(compare_df2, ab_df2, on = 'Batter', how='inner')
compare_df2['Batter'] = compare_df2['Batter'].apply(lambda x: ' '.join(x.split(' ', 2)[:2]))
compare_df2
Accuracy: 0.74
Classification Report:
precision recall f1-score support
0 0.75 0.85 0.80 115
1 0.74 0.59 0.66 81
accuracy 0.74 196
macro avg 0.74 0.72 0.73 196
weighted avg 0.74 0.74 0.74 196
Confusion Matrix:
[[98 17]
[33 48]]
| Batter | xAVG | New_xAVG_gb | AVG | AB | |
|---|---|---|---|---|---|
| 0 | A. Parker | 0.285472 | 0.396526 | 0.416667 | 108 |
| 1 | B. Mortensen | 0.286111 | 0.419291 | 0.430556 | 72 |
| 2 | C. Kirtley | 0.276450 | 0.369163 | 0.360000 | 100 |
| 3 | C. Nunez | 0.245736 | 0.389942 | 0.377358 | 106 |
| 4 | I. Brethowr | 0.387163 | 0.468026 | 0.511628 | 86 |
| 5 | J. Brown | 0.013000 | 0.059848 | 0.000000 | 1 |
| 6 | J. Newman | 0.331318 | 0.442459 | 0.386364 | 44 |
| 7 | J. Sebring | 0.356365 | 0.463254 | 0.442308 | 52 |
| 8 | J. Sundstrom | 0.370966 | 0.443256 | 0.454545 | 88 |
| 9 | J. Trimble | 0.484083 | 0.521185 | 0.416667 | 12 |
| 10 | J. Williams | 0.249800 | 0.318596 | 0.300000 | 10 |
| 11 | L. Mccollum | 0.314000 | 0.428821 | 0.473118 | 93 |
| 12 | L. Mosby | 0.198778 | 0.251369 | 0.333333 | 9 |
| 13 | N. Oakley | 0.256420 | 0.354417 | 0.289855 | 69 |
| 14 | Z. Darby | 0.260111 | 0.392921 | 0.404762 | 126 |
# Plotting
fig2 = px.bar(
compare_df2,
x='Batter',
y=['New_xAVG_gb', 'AVG', 'xAVG'],
labels={'value': 'Values', 'variable': 'Statistics'},
width=1000,
height=600,
color_discrete_map={
'New_xAVG_gb': '#3498db',
'AVG': '#F5E834',
'xAVG': 'lightgrey'
},
)
fig2.update_layout(
barmode='group',
xaxis=dict(
tickangle=55,
tickmode='array',
tickvals=list(range(len(compare_df2['Batter']))),
ticktext=compare_df['Batter'],
tickfont=dict(size=15, color='white', family='Arial')
),
paper_bgcolor='rgba(17,17,17,17)',
plot_bgcolor='rgba(17,17,17,17)',
title=dict(text='<b>Comparison between New xAVG, AVG, and xAVG for Gradient Boosted Model</b>', font=dict(color='white', size=18),x=0.45,y=0.93, xanchor='center'),
legend=dict(title=dict(font=dict(color='white'))),
yaxis=dict(tickfont=dict(color='white')),
)
fig2.update_traces(marker_line_color='black', marker_line_width=2)
fig2.show()
We see from our plots that our New xAVG does much better at giving a players xAVG over the season. Before we saw that almost every player was being undervalued for their xAVG which was not an accurate representation of play on the field.
Our analysis indicates that the Random Forest and Gradient Boosted models performed exceptionally well compared to other models. This aligns with our expectations, as these models are known for their effectiveness. Notably, our new metric for xAVG showed superior performance compared to the traditional xAVG used by Trumedia.
Looking ahead, we are committed to utilizing this advanced model to predict players' xAVG during the upcoming baseball season for the Gauchos. One challenge we faced was the limited availability of pitch-by-pitch data for all our games. Gathering more data would certainly enhance the model's performance once we have access to it.
This predictive tool not only provides a comprehensive assessment of a player's performance but also offers valuable insights to our coaching staff. By analyzing a player's xAVG, our coaching team gains a more nuanced perspective on a player's actual effectiveness, regardless of their official batting average (AVG). This is especially important when a player's AVG might appear low, but their xAVG is significantly higher. For example, if a player maintains an AVG of .200 but has an xAVG of .370, we can confidently conclude that, statistically, their performance exceeds what the AVG alone suggests. This nuanced approach equips our coaching staff with a more accurate metric for evaluating player performance, enabling informed decisions and strategic insights throughout the season.